library(ggplot2)
library(dplyr)
library(stringr)
adlb <- pharmaverseadam::adlb %>%
filter(str_detect(AVISIT, "^(Baseline|Week)")) %>%
mutate(AVISITN = if_else(AVISIT == "Baseline", 0L, as.integer(str_extract(AVISIT, "[0-9]{1,2}"))))
adae <- pharmaverseadam::adae
labs <- adlb %>%
filter(ABLFL %in% "Y") %>%
tidyr::pivot_wider(
id_cols = USUBJID,
names_from = PARAMCD,
values_from = AVAL
)From SAS to R
Module 4: Data Visualization
Introduction
R is well known for its data visualization capabilities. While R itself ships with the built-in {graphics} package, by far the most popular package for creating statistical graphics is {ggplot2} which builds upon a theoretical framework called the “Grammar of Graphics”, published by Leland Wilkinson in 1999.
A core concepts of the grammar of graphics is that plots are build using layers. You’ll see what exactly that means throughout the examples that follow. Notice that while this document is structured by visualization type, e.g. bar chart, the concepts underlying the creation of these different visualizations are for the most part universally applicable to any type of visualization you want to create.
For illustration purposes we’ll use a CDISC ADLB dataset containing various lab values measured over the course of a clinical trial.
Scatter Plot
Regardless of the type of data visualization you want to create with {ggplot2} the first step is always calling the ggplot() function.
ggplot()Called on its own it doesn’t do much: it merely prints a canvas with gray background. Typically you’d call ggplot() and provide inputs to the data and mapping arguments. The former is the dataset from which you want to plot something and the latter a mapping of columns from that dataset to aesthetics.
ggplot(data = labs, mapping = aes(x = HGB, y = HCT))Just by providing the x and y aesthetics ggplot() adds a x and y axis with limits automatically determined from the data. To actually print data points we have to add geom_point() to the output of ggplot() using the + operator.
ggplot(data = labs, mapping = aes(x = HGB, y = HCT)) +
geom_point() To change the point color we can pass a value to the color parameter of geom_point().
ggplot(data = labs, mapping = aes(x = HGB, y = HCT)) +
geom_point(color = "royalblue") If you want to change the limits of the x and y axis you can use the xlim() and ylim() function, respectively.
ggplot(data = labs, mapping = aes(x = HGB, y = HCT)) +
geom_point(color = "royalblue") +
xlim(6, 12) +
ylim(0.3, 0.6) We can use the labs() (short for “labels”) function to change the axis labels.
ggplot(data = labs, mapping = aes(x = HGB, y = HCT)) +
geom_point(color = "royalblue") +
xlim(6, 12) +
ylim(0.3, 0.6) +
labs(
x = "Hemoglobin",
y = "Hematocrit"
) We can easily add another layer on top of the points, e.g. a smoothing line, using geom_smooth().
ggplot(data = labs, mapping = aes(x = HGB, y = HCT)) +
geom_point(color = "royalblue") +
geom_smooth() +
xlim(6, 12) +
ylim(0.3, 0.6) +
labs(
x = "Hemoglobin",
y = "Hematocrit"
)By default geom_smooth() fits a LOESS (local polynomial smoothing splines) model to the data. You can change that using the method parameter. "lm" fits a linear regression model. Also just like geom_point(), geom_smooth() has a color parameter which changes the line color.
ggplot(data = labs, mapping = aes(x = HGB, y = HCT)) +
geom_point(color = "royalblue") +
geom_smooth(color = "darkorange", method = "lm") +
xlim(6, 12) +
ylim(0.3, 0.6) +
labs(
x = "Hemoglobin",
y = "Hematocrit"
)In the example above we called geom_point() first and then geom_smooth() afterwards. Thus, the points are the first (deep) layer and the smoothing line is the second (superficial) layer on top of the points. By changing the order of calling these two functions we can plot the points on top of the line.
ggplot(data = labs, mapping = aes(x = HGB, y = HCT)) +
geom_smooth(color = "darkorange", method = "lm") +
geom_point(color = "royalblue") +
xlim(6, 12) +
ylim(0.3, 0.6) +
labs(
x = "Hemoglobin",
y = "Hematocrit"
)By default geom_smooth() adds a 95% confidence interval band around the regression line. To suppress this set se = FALSE.
ggplot(data = labs, mapping = aes(x = HGB, y = HCT)) +
geom_smooth(
color = "darkorange",
method = "lm",
se = FALSE
) +
geom_point(color = "royalblue") +
xlim(6, 12) +
ylim(0.3, 0.6) +
labs(
x = "Hemoglobin",
y = "Hematocrit"
)If you want the regression line to extend from the lower to the upper limit of the x axis rather then just the limits of the data points used to fit it then you can set fullrange = TRUE.
ggplot(data = labs, mapping = aes(x = HGB, y = HCT)) +
geom_smooth(
color = "darkorange",
method = "lm",
se = FALSE,
fullrange = TRUE
) +
geom_point(color = "royalblue") +
xlim(6, 12) +
ylim(0.3, 0.6) +
labs(
x = "Hemoglobin",
y = "Hematocrit"
)To change the point size and line width you can use the size and linewdith parameters, respectively. The unit for both is millimeters (mm).
ggplot(data = labs, mapping = aes(x = HGB, y = HCT)) +
geom_smooth(
color = "darkorange",
method = "lm",
se = FALSE,
fullrange = TRUE,
linewidth = 2
) +
geom_point(color = "royalblue", size = 2) +
xlim(6, 12) +
ylim(0.3, 0.6) +
labs(
x = "Hemoglobin",
y = "Hematocrit"
)Line Charts
To create a line chart we’ll first filter the lab dataset down to a single parameter (hemoglobin) for a single patient in the trial (more complex line plots follow further down).
set.seed(7L)
hgb_1015 <- adlb %>%
filter(PARAM == "Hemoglobin (mmol/L)", USUBJID == "01-701-1015")Creating a line chart is not much different from creating a scatter plot. You still have to provide the x and y aesthetics but instead of calling geom_point() you call geom_line().
ggplot(data = hgb_1015, mapping = aes(x = ADY, y = AVAL)) +
geom_line()We can add a plot title using the title parameter of the labs() function and change the line color and width as discussed above.
ggplot(data = hgb_1015, mapping = aes(x = ADY, y = AVAL)) +
geom_line(color = "steelblue", linewidth = 1.2) +
labs(
title = paste("Subject", unique(hgb_1015$USUBJID)),
x = "Study Day",
y = unique(hgb_1015$PARAM)
) To change the appearance of the plot we can use a different theme. The default with light gray background and with grid lines is called theme_gray().
ggplot(data = hgb_1015, mapping = aes(x = ADY, y = AVAL)) +
geom_line(color = "steelblue", linewidth = 1.2) +
labs(
title = paste("Subject", unique(hgb_1015$USUBJID)),
x = "Study Day",
y = unique(hgb_1015$PARAM)
) +
theme_minimal() This is the list of themes that ship with the {ggplot2} package. Many more are available in other packages, e.g. {ggthemes}, and you can always change existing themes or create your own.
theme_void()theme_dark()theme_set()theme_classic()theme_test()theme_linedraw()theme_minimal()theme_update()theme_light()theme_replace()theme_grey()theme_get()theme_bw()theme_gray()
So far we’ve only plotted a single line for a single patient. You can easily plot multiple lines for multiple patients on the same plot by setting the color aesthetic to the subject identifier column, i.e. USUBJID.
hgb_subset <- adlb %>%
filter(
PARAM == "Hemoglobin (mmol/L)",
USUBJID %in% sample(unique(adlb$USUBJID), 5L)
) %>%
select(USUBJID, ARM, PARAM, ADY, AVISIT, AVISITN, AVAL, CHG, PCHG)
ggplot(
data = hgb_subset,
aes(x = ADY, y = AVAL, color = USUBJID)
) +
geom_line()The legend position is a property of the theme you are using. You can change it and any other theme properties (there are a lot!) using the theme() function.
ggplot(
data = hgb_subset,
aes(x = ADY, y = AVAL, color = USUBJID)
) +
geom_line() +
theme(legend.position = "bottom") To suppress the legend name call the appropriate scaling function, in this case scale_color_discrete(), and set the name parameter to NULL.
ggplot(
data = hgb_subset,
aes(x = ADY, y = AVAL, color = USUBJID)
) +
geom_line() +
scale_color_discrete(name = NULL) +
theme(legend.position = "bottom")To change the color palette you can call a different scaling function. A popular one is scale_color_brewer() which allows you to specify a color palette implemented in the {RColorBrewer} package.
ggplot(
data = hgb_subset,
aes(x = ADY, y = AVAL, color = USUBJID)
) +
geom_line() +
scale_color_brewer(name = NULL, palette = "Set2") +
theme(legend.position = "bottom")If you do not want to use an existing color palette but specify the colors for each patient individually you can use the scale_color_manual() function and pass a named vector of colors to its values parameter. The names of that vector should be values within the column passed to the color aesthetic (USUBJID in this example) and the values valid color identifiers. Those can be one of R’s build in color strings, a hex code, or RGB value. To list the former call colors().
colors <- c(
"01-702-1082" = "steelblue",
"01-706-1049" = rgb(255, 165, 0, maxColorValue = 255), # orange
"01-710-1060" = "pink",
"01-716-1024" = "darkgreen",
"01-716-1151" = "#7f7f7f" # gray50
) #<<
ggplot(
data = hgb_subset,
aes(x = ADY, y = AVAL, color = USUBJID)
) +
geom_line() +
scale_color_manual(name = NULL, values = colors) +
theme_classic() +
theme(legend.position = "bottom")So far we haven’t used any statistical transformations when creating plots. In both the scatter plot and line chart examples we just plotted values as they are in the data itself. {ggplot2} can do more, though. Each geom_*() function has a stat parameter which specifies the transformation to apply to the data. For geom_point() and geom_line() this is "identity" by default, i.e. don’t transform the data. A useful example of applying a transformation is plotting the mean for each treatment arm in the dataset by setting stat = "summary" and fun = "mean".
hgb_all <- adlb %>%
filter(
PARAM == "Hemoglobin (mmol/L)",
str_detect(AVISIT, "Base|Week")
)
ggplot(
data = hgb_all,
mapping = aes(x = AVISIT, y = AVAL, color = ARM)
) +
geom_line(stat = "summary", fun = "mean")This didn’t quite work. In addition to setting the color aesthetic to "ARM" we have to do the same with the group aesthetic.
ggplot(
data = hgb_all,
mapping = aes(x = AVISIT, y = AVAL, color = ARM, group = ARM)
) +
geom_line(stat = "summary", fun = "mean")This looks much better. However, notice that the visits are not in the correct order because they are sorted alphanumerically by default. Thus, "Week 16" occurs before "Week 2". A neat way to change this is to turn AVISIT from a character into an ordered factor using the reorder() function and specifying a corresponding numeric values to each character value. When following CDISC standards you typically have a AVISITN column which is just what we need.
ggplot(
data = hgb_all,
mapping = aes(x = reorder(AVISIT, AVISITN), y = AVAL, color = ARM, group = ARM)
) +
geom_line(stat = "summary", fun = "mean")While the visit are now ordered correctly notice that the difference between each visit on the x axis is the same. However, there’s a gap of two week between "Week 4" and "Week 6" but a gap of 4 weeks between "Week 20" and "Week 24". The only way to have the x axis use the real difference between adjacent values is to use AVISITN rather than AVISIT as the x aesthetic and passing a function that returns a character label for each numeric value in AVISITN to the labels parameter of scale_x_continuous().
label_visits <- function(visit_num) {
if_else(visit_num == "0", "BL", paste0("W", visit_num))
}
ggplot(
data = hgb_all,
mapping = aes(x = AVISITN, y = AVAL, color = ARM, group = ARM)
) +
geom_line(stat = "summary", fun = "mean") +
scale_x_continuous(
breaks = unique(hgb_all$AVISITN),
labels = label_visits
) Rather than showing the three mean plots for each of the treatment arms on the same plot we can create small multiples using the facet_wrap() function.
ggplot(
data = hgb_all,
mapping = aes(x = AVISITN, y = AVAL, color = ARM, group = ARM)
) +
geom_line(stat = "summary", fun = "mean") +
scale_x_continuous(
breaks = unique(hgb_all$AVISITN),
labels = label_visits
) +
facet_wrap(vars(ARM)) Having a small multiple for each treatment arm makes the legend redundant so let’s remove it.
ggplot(
data = hgb_all,
mapping = aes(x = AVISITN, y = AVAL, color = ARM, group = ARM)
) +
geom_line(stat = "summary", fun = "mean") +
scale_x_continuous(
breaks = unique(hgb_all$AVISITN),
labels = label_visits
) +
facet_wrap(vars(ARM)) +
theme(legend.position = "none") Indicating every single visit on the x axis doesn’t work so let’s choose a subset of visits to display. To do so pass a vector of visits to display to the breaks parameter of scale_x_continuous(). Notice that this should be the actual values from the data, i.e. numbers, rather than the labels which are the result of applying the function passed to the labels parameter.
ggplot(
data = hgb_all,
mapping = aes(x = AVISITN, y = AVAL, color = ARM, group = ARM)
) +
geom_line(stat = "summary", fun = "mean") +
scale_x_continuous(
breaks = c(0, 8, 16, 26),
labels = label_visits
) +
facet_wrap(vars(ARM)) +
theme(legend.position = "none")If you want to create a two-dimensional grid use facet_grid() instead.
ggplot(
data = hgb_all,
mapping = aes(x = AVISITN, y = AVAL, color = ARM, group = ARM)
) +
geom_line(stat = "summary", fun = "mean") +
scale_x_continuous(
breaks = c(0, 8, 16, 26),
labels = label_visits
) +
facet_grid(vars(SEX), vars(ARM)) +
theme(legend.position = "none")Notice how easy it was to get the mean per treatment arm stratified by sex without having to do any data pre-processing.
For the time being let’s return to the one-dimensional grid created by facet_wrap() and add the 95% confidence interval for the mean around the line using geom_ribbon(). This requires us to use stat = "summary" again but this time fun should return three values: the mean, lower, and upper 95% confidence interval. Fortunately, the mean_cl_normal() furnction does just that. Notice, though, that we pass it to fun.data instead of fun.
ggplot(
data = hgb_all,
mapping = aes(x = AVISITN, y = AVAL, color = ARM, group = ARM)
) +
geom_ribbon(stat = "summary", fun.data = mean_cl_normal) +
geom_line(stat = "summary", fun = "mean") +
scale_x_continuous(
breaks = c(0, 8, 16, 26),
labels = label_visits
) +
facet_wrap(vars(ARM)) +
theme(legend.position = "none")That doesn’t look quite right… The color aesthetic gets applied to the outer borders of the ribbon created by geom_ribbon(). To change the color in the inside we have to specify the fill aesthetic.
ggplot(
data = hgb_all,
mapping = aes(x = AVISITN, y = AVAL, color = ARM, group = ARM, fill = ARM)
) +
geom_ribbon(stat = "summary", fun.data = mean_cl_normal, alpha = .2) +
geom_line(stat = "summary", fun = "mean") +
scale_x_continuous(
breaks = c(0, 8, 16, 26),
labels = label_visits
) +
facet_wrap(vars(ARM)) +
theme(legend.position = "none")Finally, to get rid of the outer lines of the ribbon you can set linewidth = 0.
ggplot(
data = hgb_all,
mapping = aes(x = AVISITN, y = AVAL, color = ARM, group = ARM, fill = ARM)
) +
geom_ribbon(
stat = "summary",
fun.data = mean_cl_normal,
alpha = .2,
linewidth = 0
) +
geom_line(stat = "summary", fun = "mean") +
scale_x_continuous(
breaks = c(0, 8, 16, 26),
labels = label_visits
) +
facet_wrap(vars(ARM)) +
theme(legend.position = "none")Bar Charts
To illustrate the creation of bar charts we’ll plot the number of subjects with an abnormally low hemoglobin measurement at each visit through the trial. To do so we’ll use the geom_bar() function. Unlike the other geoms we’ve used so far this geom only requires an x aesthetic. The count at each x value is calculated automatically geom_bar()’s default value for the stat parameter is "count".
hgb_below_lln <- hgb_all %>%
filter(LBNRIND == "LOW", AVISITN %in% c(0, 8, 16, 24)) %>%
mutate(AVISIT = reorder(AVISIT, AVISITN))
ggplot(hgb_below_lln, aes(x = AVISIT)) +
geom_bar()To stratify the bars by treatment arm we can use the fill aesthetic.
ggplot(hgb_below_lln, aes(x = AVISIT, fill = ARM)) +
geom_bar()To display the bars side-by-side rather than stacked set position = "dodge".
ggplot(hgb_below_lln, aes(x = AVISIT, fill = ARM)) +
geom_bar(position = "dodge") Notice that {ggplot2} by default adds a 5% padding around the axis limits and thus there’s a “gap” between the bottom of the bars and the axis labels which looks odd. You can change that using the expand parameter of scale_y_continuous(). While we’re at it, let’s also move the legend to the top and change the color palette.
ggplot(hgb_below_lln, aes(x = AVISIT, fill = ARM)) +
geom_bar(position = "dodge") +
scale_y_continuous(expand = expansion(mult = 0)) +
scale_fill_manual(
name = NULL,
values = c( #<<
"Placebo" = "lightgray",
"Xanomeline Low Dose" = "lightblue",
"Xanomeline High Dose" = "royalblue"
)
) +
theme(legend.position = "top") To switch the order of the bars we have to turn ARM into a factor and specify the levels in the desired order.
hgb_below_lln$ARM <- factor(
hgb_below_lln$ARM,
levels = c("Placebo", "Xanomeline Low Dose", "Xanomeline High Dose")
)
ggplot(hgb_below_lln, aes(x = AVISIT, fill = ARM)) +
geom_bar(position = "dodge") +
scale_y_continuous(expand = expansion(mult = 0)) +
scale_fill_manual(
name = NULL,
values = c(
"Placebo" = "lightgray",
"Xanomeline Low Dose" = "lightblue",
"Xanomeline High Dose" = "royalblue"
)
) +
theme(legend.position = "top")When displaying counts it’s odd to have decimal numbers appear in the y axis labels. Let’s set different breaks.
ggplot(hgb_below_lln, aes(x = AVISIT, fill = ARM)) +
geom_bar(position = "dodge") +
scale_y_continuous(
expand = expansion(mult = 0),
breaks = seq(from = 0, to = 10, by = 2),
limits = c(0, 10)
) +
scale_fill_manual(
name = NULL,
values = c(
"Placebo" = "lightgray",
"Xanomeline Low Dose" = "lightblue",
"Xanomeline High Dose" = "royalblue"
)
) +
theme(legend.position = "top")Finally, let’s adjust the theme to only display horizontal grid lines at the y axis ticks.
ggplot(hgb_below_lln, aes(x = AVISIT, fill = ARM)) +
geom_bar(position = "dodge") +
scale_y_continuous(
expand = expansion(mult = 0),
breaks = seq(from = 0, to = 10, by = 2),
limits = c(0, 10)
) +
scale_fill_manual(
name = NULL,
values = c(
"Placebo" = "lightgray",
"Xanomeline Low Dose" = "lightblue",
"Xanomeline High Dose" = "royalblue"
)
) +
theme_minimal() +
theme(
legend.position = "top",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
) Next, let’s not plot the absolute count on the y axis but rather the percentage of subjects with an abnormal value relative to the number of patients who have a measurement at that particular visit. To make this works we’ll have to do some data pre-processing.
big_n <- hgb_all %>%
group_by(ARM, AVISIT, AVISITN) %>%
summarize(N = n(), .groups = "drop")
hgb_below_lln_pct <- hgb_below_lln %>%
group_by(ARM, AVISIT, AVISITN) %>%
summarize(n = n(), .groups = "drop") %>%
left_join(big_n, by = join_by(ARM, AVISIT, AVISITN)) %>%
mutate(pct = n / N)Since we have calculated the value to display on the y axis ourselves we have to set the y aesthetic accordingly.
ggplot(
data = hgb_below_lln_pct,
mapping = aes(x = AVISIT, y = pct, fill = ARM)
) +
geom_bar()Error in `geom_bar()`:
! Problem while computing stat.
ℹ Error occurred in the 1st layer.
Caused by error in `setup_params()`:
! `stat_count()` must only have an x or y aesthetic.
That doesn’t work, though, because we’re still using the default stat = "count" inside geom_bar(). Let’s use the "identity" transformation instead.
ggplot(
data = hgb_below_lln_pct,
mapping = aes(x = AVISIT, y = pct, fill = ARM)
) +
geom_bar(stat = "identity") Next, let’s apply some of the adjustments we made for the first bar chart.
ggplot(hgb_below_lln_pct, aes(x = AVISIT, y = pct, fill = ARM)) +
geom_bar(position = "dodge", stat = "identity") +
scale_y_continuous(expand = expansion(mult = 0)) +
scale_fill_manual(
name = NULL,
values = c(
"Placebo" = "lightgray",
"Xanomeline Low Dose" = "lightblue",
"Xanomeline High Dose" = "royalblue"
)
) +
theme_minimal() +
theme(
legend.position = "top",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
)This looks almost perfect but notice that the y axis labels are fractions rather than percentages. To change this we can pass a custom function to the labels parameter of scale_y_continuous().
scale_pct <- function(x) {
paste0(x * 100, "%")
}
ggplot(hgb_below_lln_pct, aes(x = AVISIT, y = pct, fill = ARM)) +
geom_bar(position = "dodge", stat = "identity") +
scale_y_continuous(
expand = expansion(mult = 0),
labels = scale_pct
) +
scale_fill_manual(
name = NULL,
values = c(
"Placebo" = "lightgray",
"Xanomeline Low Dose" = "lightblue",
"Xanomeline High Dose" = "royalblue"
)
) +
theme_minimal() +
theme(
legend.position = "top",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
)Another example of a bar chart is plotting the top 10 most frequently occurring adverse events across the two treatment arm receiving the experimental drug. Here’s the step-by-step code.
top_10_pt <- adae %>%
filter(str_detect(ARM, "Xano")) %>%
group_by(USUBJID, AEDECOD) %>%
slice_head() %>%
ungroup() %>%
group_by(AEDECOD) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
slice(1:10)
ggplot(data = top_10_pt, mapping = aes(x = n, y = AEDECOD)) +
geom_bar(stat = "identity")ggplot(data = top_10_pt, mapping = aes(x = n, y = reorder(AEDECOD, n))) +
geom_bar(stat = "identity")ggplot(data = top_10_pt, mapping = aes(x = n, y = reorder(AEDECOD, n))) +
geom_bar(stat = "identity") +
scale_x_continuous(expand = expansion(mult = c(0, .05))) ggplot(data = top_10_pt, mapping = aes(x = n, y = reorder(AEDECOD, n))) +
geom_bar(stat = "identity", fill = "darkgreen") +
scale_x_continuous(expand = expansion(mult = c(0, .05))) ggplot(data = top_10_pt, mapping = aes(x = n, y = reorder(AEDECOD, n))) +
geom_bar(stat = "identity", fill = "darkgreen") +
scale_x_continuous(expand = expansion(mult = c(0, .05))) +
labs(
x = NULL,
y = NULL,
title = "Top 10 Most Frequent Adverse Events"
) ggplot(data = top_10_pt, mapping = aes(x = n, y = reorder(AEDECOD, n))) +
geom_bar(stat = "identity", fill = "darkgreen") +
scale_x_continuous(expand = expansion(mult = c(0, .05))) +
scale_y_discrete(labels = str_to_title) +
labs(
x = "Number of Subjects with at Least One AE",
y = NULL,
title = "Top 10 Most Frequent Adverse Events"
) +
theme_classic() ggplot(data = top_10_pt, mapping = aes(x = n, y = reorder(AEDECOD, n))) +
geom_bar(stat = "identity", fill = "darkgreen") +
scale_x_continuous(expand = expansion(mult = c(0, .05))) +
scale_y_discrete(labels = str_to_title) +
labs(
x = "Number of Subjects With at Least One AE",
y = NULL,
title = "Top 10 Most Frequent Adverse Events"
) +
theme_classic() +
theme(
axis.ticks = element_blank(),
axis.line = element_blank(),
panel.grid.major.x = element_line()
) Box Plots
Given all the knowledge you’ve gained from the examples so far creating a box plot should feel rather easy. Box plots require the x and y aesthetic and in addition, if you want to stratify the values at each x value by a categorical variable, the color (or fill) aesthetic. Once these mappings are specified you just have to add a call to geom_boxplot() and {ggplot2} will derive all required statistics, like the median, for you.
hgb_bl_w26 <- filter(hgb_all, AVISIT %in% c("Baseline", "Week 26"))
ggplot(
data = hgb_bl_w26,
mapping = aes(x = AVISIT, y = AVAL, color = ARM)
) +
geom_boxplot()If you want to add a point to indicate the arithmetic mean you can add another layer created with geom_point() on top of geom_boxplot(). Note that unless you specify position = position_dodge(0.75) the points will all appear in the middle rather than inside their respective boxes.
ggplot(
data = hgb_bl_w26,
mapping = aes(x = AVISIT, y = AVAL, color = ARM)
) +
geom_boxplot() +
geom_point(
stat = "summary",
fun = "mean",
shape = "triangle",
position = position_dodge(0.75),
size = 2
) Another useful layer to add on top of the boxes is lines connecting the values for each individual subject. To do so we’ll make use of small multiples and add a layer created with geom_line(). In order to not connect the values across all subjects but just those for each subject we have to specify the group aesthetic.
ggplot(
data = hgb_bl_w26,
mapping = aes(x = AVISIT, y = AVAL, color = ARM, group = USUBJID)
) +
geom_boxplot() +
geom_point(
stat = "summary",
fun = "mean",
shape = "triangle",
position = position_dodge(0.75),
size = 2
) +
geom_line() +
facet_wrap(~ARM) That certainly produced some lines but where did the box plots go and why are there so many triangles? The issue here is that if you specify aesthetics within ggplot() those are inherited by all layers. For this plot to work, though, group should just be considered for the geom_line() layer. This can be achieved by setting the group aesthetic with the geom_line() call.
ggplot(
data = hgb_bl_w26,
mapping = aes(x = AVISIT, y = AVAL, color = ARM)
) +
geom_boxplot() +
geom_point(
stat = "summary",
fun = "mean",
shape = "triangle",
position = position_dodge(0.75),
size = 2
) +
geom_line(mapping = aes(group = USUBJID), alpha = .25) +
facet_wrap(~ARM) For the final touch, let’s remove the legend and add some descriptive axis labels.
ggplot(
data = hgb_bl_w26,
mapping = aes(x = AVISIT, y = AVAL, color = ARM)
) +
geom_boxplot() +
geom_point(
stat = "summary",
fun = "mean",
shape = "triangle",
position = position_dodge(0.75),
size = 2
) +
geom_line(aes(group = USUBJID), alpha = .25) +
facet_wrap(~ARM) +
theme(legend.position = "none") +
labs(
x = "Visit",
y = unique(hgb_all$PARAM)
) Et voilà!
Kaplan Meier Plot
While you could use {ggplot2} on its own to create Kaplan-Meier-Plots this is somewhat cumbersome. Instead, I’d recommend you use the {ggsurvfit} package which does a lot of the heavy lifting for you. The first step is fitting a survival model. We’ll discuss this in detail in module 6 so I won’t comment on the model fitting code here.
library(ggsurvfit)
data(adtte)
survival_model <- survfit2(Surv_CNSR() ~ STR01, data = adtte)Once you’ve fitted a survival model creating a Kaplan-Meier-Plot is as easy as calling the ggsurvfit() function with the model object as first argument.
ggsurvfit(survival_model)If you want to display a confidence interval around the estimates use the add_confidence_interval() function.
ggsurvfit(survival_model) +
add_confidence_interval() By including a call to add_risktable() you can have a risk table be displayed underneath the plot.
ggsurvfit(survival_model) +
add_confidence_interval() +
add_risktable() To add an indicator for the median survival times use the add_quantile() function with the y_value parameter being set to 0.5.
ggsurvfit(survival_model) +
add_confidence_interval() +
add_risktable() +
add_quantile(y_value = 0.5) The output of ggsurvfit() is a “usual” ggplot object so you can call any “regular” function from the {ggplot2} package on it.
ggsurvfit(survival_model) +
add_confidence_interval() +
add_risktable() +
add_quantile(y_value = 0.5) +
scale_x_continuous(breaks = 0:5) Instead of displaying "At Risk" and "Events" in the rows of the risk table you can have an indicator for the stratification variable be displayed. To do so, set risktable_group = "risktable_stats" inside add_rsiktabe() and add a call to add_risktable_strata_symbol().
ggsurvfit(survival_model) +
add_confidence_interval() +
add_risktable(risktable_group = "risktable_stats") +
add_quantile(y_value = 0.5) +
add_risktable_strata_symbol() +
scale_x_continuous(breaks = 0:5)